- 
                Notifications
    You must be signed in to change notification settings 
- Fork 422
Stop relying on ChannelMonitor persistence after manager read #3322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop relying on ChannelMonitor persistence after manager read #3322
Conversation
| A bunch of tests are still trying to call  | 
| Ugh, sorry, fixed. Also rebased and fixed an issue introduced in #3303 | 
5437927    to
    b0fa756      
    Compare
  
    | Codecov ReportAttention: Patch coverage is  
 Additional details and impacted files@@            Coverage Diff             @@
##             main    #3322      +/-   ##
==========================================
- Coverage   89.68%   89.66%   -0.02%     
==========================================
  Files         126      126              
  Lines      103168   103370     +202     
  Branches   103168   103370     +202     
==========================================
+ Hits        92522    92686     +164     
- Misses       7934     7971      +37     
- Partials     2712     2713       +1     
 Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. | 
| Tagging 0.1 because of the first commit. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: fix "insetad" typo in commit message d44066b
| .expect("Failed to get node_id for phantom node recipient"); | ||
| receiver_node_id = phantom_pubkey; | ||
| break; | ||
| let (sources, claiming_payment) = { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to compare the equivalence between the before and after, the scroll distance is 5,000 lines of code. I wonder whether perhaps some of this could be moved into a separate file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
git show --color-moved is your friend here, I think :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, yea, obviously we should move parts of ChannelManager out, but I don't think this specific logic should be moved given its tie to other stuff in ChannelManager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay! Still making my way through this.
| return Err(DecodeError::InvalidValue); | ||
| }, | ||
| hash_map::Entry::Vacant(entry) => { | ||
| entry.insert(payment_claim.claiming_payment); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a panic here hits a bunch of tests, which is good, but for some of them I'm not sure why or if we should be reinserting here. For example, we hit this in test_mpp_keysend on drop of the test Node. In that test, it seems like the preimage should've been pruned from the monitor already. Do you know why this is happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We drop preimages one update cycle after we need to. So if a channel completed a claim in a full normal cycle and is "at rest" it'll still have the preimage. Only after the next round of commitment updates will that preimage get dropped. This supports some weird setups where you might have multiple redundant copies of a ChannelMonitor but also its kinda nice. Still, its definitely weird here, it means if a channel is "at rest" we'll always replay a claim event on startup.
With the claim IDs I think its okay, and kinda nice anyway cause it reduces chances of errors on the client side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's useful info. I think it would be good to put some of that on ChannelMonitorImpl::payment_preimages since IIUC that's the reason we hold a Vec<PaymentClaimDetails> there, which implies allowing multiple separate payments to be pending claim for the same payment hash. I was a little confused why we allow duplicate payment hash payments there at first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that's a part of it? I wasn't really being so deliberate about it - generally we just try not to assume too much about payment hash reuse so it seemed weird to assume it can't happen here. Good to document either way.
| assert!(get_monitor!(nodes[3], chan_id_persisted).get_stored_preimages().contains_key(&payment_hash)); | ||
| assert!(get_monitor!(nodes[3], chan_id_not_persisted).get_stored_preimages().contains_key(&payment_hash)); | ||
|  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also check that these preimages eventually get removed from the monitors, i.e. that the RAA blocker generated on startup is processed? It might be nice to also adapt Node::drop to check that there are no dangling RAA blockers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also check that these preimages eventually get removed from the monitors, i.e. that the RAA blocker generated on startup is processed?
Done
It might be nice to also adapt Node::drop to check that there are no dangling RAA blockers.
Hmm, I checked, looks like ~75 tests fail if we do it...that might be one for a followup 😅
dabb898    to
    cd4f665      
    Compare
  
    `or_default` is generally less readable than writing out the thing we're writing, as `Default` is opaque but explicit constructors generally are not. Thus, we ignore the clippy lint (ideally we could invert it and ban the use of `Default` in the crate entirely but alas).
In aa09c33 we added a new secret in `ChannelManager` with which to derive inbound `PaymentId`s. We added read support for the new field, but forgot to add writing support for it. Here we fix this oversight.
aa8a0ad    to
    2fd87cf      
    Compare
  
    | Squashed fixups and rebased to tell clippy to shut up. | 
| return Err(DecodeError::InvalidValue); | ||
| }, | ||
| hash_map::Entry::Vacant(entry) => { | ||
| entry.insert(payment_claim.claiming_payment); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's useful info. I think it would be good to put some of that on ChannelMonitorImpl::payment_preimages since IIUC that's the reason we hold a Vec<PaymentClaimDetails> there, which implies allowing multiple separate payments to be pending claim for the same payment hash. I was a little confused why we allow duplicate payment hash payments there at first.
| struct HTLCClaimSource { | ||
| counterparty_node_id: Option<PublicKey>, | ||
| funding_txo: OutPoint, | ||
| channel_id: ChannelId, | ||
| htlc_id: u64, | ||
| } | ||
|  | ||
| impl From<&MPPClaimHTLCSource> for HTLCClaimSource { | ||
| fn from(o: &MPPClaimHTLCSource) -> HTLCClaimSource { | ||
| HTLCClaimSource { | ||
| counterparty_node_id: Some(o.counterparty_node_id), | ||
| funding_txo: o.funding_txo, | ||
| channel_id: o.channel_id, | ||
| htlc_id: o.htlc_id, | ||
| } | ||
| } | ||
| } | ||
|  | ||
| #[derive(Clone, Debug, PartialEq, Eq)] | ||
| struct MPPClaimHTLCSource { | ||
| counterparty_node_id: PublicKey, | ||
| funding_txo: OutPoint, | ||
| channel_id: ChannelId, | ||
| htlc_id: u64, | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some comments for these structs and clarify there why we need both of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took me a while to get up to speed on this patch, but I think this is basically there on my end now.
| downstream_funding_outpoint: funding_outpoint, | ||
| downstream_funding_outpoint: _, | ||
| blocking_action: blocker, downstream_channel_id: channel_id, | ||
| } = action { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is reachable, and it's pre-existing, but FYI we don't have coverage for getting a MonitorUpdateCompletionAction::FreeOtherChannelImmediately while during_init. If it's unreachable, adding a debug_assert could be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely convinced its unreachable (and certainly not that it won't become reachable) - on the read end if an HTLC ends up in pending_claims_to_replay we'll call claim_funds_internal which for relayed payments calls claim_funds_from_hop and can return FreeOtherChannelImmediately for duplicate claims. Indeed we should get test coverage but that's not related to this PR :/.
2fd87cf    to
    3e355b3      
    Compare
  
    | LGTM after squash. | 
When we started tracking which channels had MPP parts claimed durably on-disk in their `ChannelMonitor`, we did so with a tuple. This was fine in that it was only ever accessed in two places, but as we will start tracking it through to the `ChannelMonitor`s themselves in the coming commit(s), it is useful to have it in a struct instead.
When we claim an MPP payment, then crash before persisting all the relevant `ChannelMonitor`s, we rely on the payment data being available in the `ChannelManager` on restart to re-claim any parts that haven't yet been claimed. This is fine as long as the `ChannelManager` was persisted before the `PaymentClaimable` event was processed, which is generally the case in our `lightning-background-processor`, but may not be in other cases or in a somewhat rare race. In order to fix this, we need to track where all the MPP parts of a payment are in the `ChannelMonitor`, allowing us to re-claim any missing pieces without reference to any `ChannelManager` data. Further, in order to properly generate a `PaymentClaimed` event against the re-started claim, we have to store various payment metadata with the HTLC list as well. Here we take the first step, building a list of MPP parts and metadata in `ChannelManager` and passing it through to `ChannelMonitor` in the `ChannelMonitorUpdate`s.
When we claim an MPP payment, then crash before persisting all the relevant `ChannelMonitor`s, we rely on the payment data being available in the `ChannelManager` on restart to re-claim any parts that haven't yet been claimed. This is fine as long as the `ChannelManager` was persisted before the `PaymentClaimable` event was processed, which is generally the case in our `lightning-background-processor`, but may not be in other cases or in a somewhat rare race. In order to fix this, we need to track where all the MPP parts of a payment are in the `ChannelMonitor`, allowing us to re-claim any missing pieces without reference to any `ChannelManager` data. Further, in order to properly generate a `PaymentClaimed` event against the re-started claim, we have to store various payment metadata with the HTLC list as well. Here we store the required MPP parts and metadata in `ChannelMonitor`s and make them available to `ChannelManager` on load.
In a coming commit we'll use the existing `ChannelManager` claim flow to claim HTLCs which we found partially claimed on startup, necessitating having a full `ChannelManager` when we go to do so. Here we move the re-claim logic down in the `ChannelManager`-read logic so that we have that.
Here we wrap the logic which moves claimable payments from `claimable_payments` to `pending_claiming_payments` to a new utility function on `ClaimablePayments`. This will allow us to call this new logic during `ChannelManager` deserialization in a few commits.
In the next commit we'll start using (much of) the normal HTLC claim pipeline to replay payment claims on startup. In order to do so, however, we have to properly handle cases where we get a `DuplicateClaim` back from the channel for an inbound-payment HTLC. Here we do so, handling the `MonitorUpdateCompletionAction` and allowing an already-completed RAA blocker.
When we claim an MPP payment, then crash before persisting all the relevant `ChannelMonitor`s, we rely on the payment data being available in the `ChannelManager` on restart to re-claim any parts that haven't yet been claimed. This is fine as long as the `ChannelManager` was persisted before the `PaymentClaimable` event was processed, which is generally the case in our `lightning-background-processor`, but may not be in other cases or in a somewhat rare race. In order to fix this, we need to track where all the MPP parts of a payment are in the `ChannelMonitor`, allowing us to re-claim any missing pieces without reference to any `ChannelManager` data. Further, in order to properly generate a `PaymentClaimed` event against the re-started claim, we have to store various payment metadata with the HTLC list as well. Here we finally implement claiming using the new MPP part list and metadata stored in `ChannelMonitor`s. In doing so, we use much more of the existing HTLC-claiming pipeline in `ChannelManager`, utilizing the on-startup background events flow as well as properly re-applying the RAA-blockers to ensure preimages cannot be lost.
When we discover we've only partially claimed an MPP HTLC during `ChannelManager` reading, we need to add the payment preimage to all other `ChannelMonitor`s that were a part of the payment. We previously did this with a direct call on the `ChannelMonitor`, requiring users write the full `ChannelMonitor` to disk to ensure that updated information made it. This adds quite a bit of delay during initial startup - fully resilvering each `ChannelMonitor` just to handle this one case is incredibly excessive. Over the past few commits we dropped the need to pass HTLCs directly to the `ChannelMonitor`s using the background events to provide `ChannelMonitorUpdate`s insetad. Thus, here we finally drop the requirement to resilver `ChannelMonitor`s on startup.
Because the new startup `ChannelMonitor` persistence semantics rely on new information stored in `ChannelMonitor` only for claims made in the upgraded code, users upgrading from previous version of LDK must apply the old `ChannelMonitor` persistence semantics at least once (as the old code will be used to handle partial claims).
3e355b3    to
    ba26432      
    Compare
  
    | Squashed without further changes. | 
When we discover we've only partially claimed an MPP HTLC during
ChannelManagerreading, we need to add the payment preimage toall other
ChannelMonitors that were a part of the payment.We previously did this with a direct call on the
ChannelMonitor,requiring users write the full
ChannelMonitorto disk to ensurethat updated information made it.
This adds quite a bit of delay during initial startup - fully
resilvering each
ChannelMonitorjust to handle this one case isincredibly excessive.
Instead, we rewrite the MPP claim replay logic to use only (new) data included in
ChannelMonitors, which has a nice side-effect of teeing up futureChannelManager-non-persistence features as well as makes ourPaymentClaimedevent generation much more robust.